Common terms analyses

In this section we compare the most common terms used by the reviewers in Tripadvisor and in Booking. We preprocessed all the words by applying the Snowball library (http://snowballstem.org/), to reduce each word to its base form.

Language comparison

In this graph we show the differences between the two datasets in the most common words, for four languages.

The X axis shows the number of most common words considered, the Y axis shows the corresponding number of different most common words. difference by language

The difference seems to progress linearly for the first 1000 most common words and it is roughly 25% of difference in the most common words.

City comparison

In this graph we show the differences in the two datasets, considering different cities around the world. For this analysis we used only reviews in English language.

The X axis shows the number of most common words considered, the Y axis shows the corresponding number of different most common words. difference by language

The graph shows that the differences in the most common words are similar for the two Italian cities, and it is smaller than the other two cities.

Wordcloud

The following wordclouds show the 10 most common words for the two datasets for the city of Lucca.

Tripadvisor Booking
Tripadvisor Lucca Booking Lucca

Term Comparison

The following table shows for Lucca, Paris and New York, the union of the 10 most common terms, and their ranking in the other cities.

term rank occurrences freq n_reviews rank_paris occurrences_paris freq_paris n_reviews_paris rank_newyork occurrences_newyork freq_newyork n_reviews_newyork
0 bed 22.0 702.0 0.005893 624.0 14.0 13845.0 7.800835e-03 12018.0 8.0 35418.0 9.788734e-03 30800.0
1 breakfast 3.0 2054.0 0.017241 1861.0 9.0 19671.0 1.108344e-02 17480.0 9.0 32384.0 8.950205e-03 27979.0
2 clean 14.0 929.0 0.007798 875.0 6.0 22225.0 1.252247e-02 20806.0 7.0 37615.0 1.039594e-02 34657.0
3 friend 9.0 1092.0 0.009166 1052.0 10.0 18922.0 1.066142e-02 18432.0 11.0 30685.0 8.480640e-03 29511.0
4 good 5.0 1550.0 0.013011 1259.0 5.0 28575.0 1.610031e-02 23423.0 6.0 38689.0 1.069276e-02 32504.0
5 great 7.0 1397.0 0.011726 1149.0 11.0 18139.0 1.022025e-02 15525.0 5.0 47970.0 1.325782e-02 40049.0
6 help 6.0 1412.0 0.011852 1326.0 8.0 20329.0 1.145418e-02 19192.0 13.0 30231.0 8.355165e-03 28131.0
7 hotel 12.0 1016.0 0.008528 745.0 4.0 37830.0 2.131496e-02 26525.0 4.0 68529.0 1.893987e-02 48314.0
8 locat 2.0 2393.0 0.020087 2235.0 2.0 51742.0 2.915354e-02 48980.0 2.0 105237.0 2.908513e-02 98636.0
9 lucca 8.0 1235.0 0.010367 1068.0 12142.0 1.0 5.634406e-07 1.0 16938.0 1.0 2.763774e-07 1.0
10 nice 10.0 1081.0 0.009074 917.0 13.0 15086.0 8.500065e-03 13134.0 15.0 25627.0 7.082723e-03 22236.0
11 room 1.0 2979.0 0.025006 2167.0 1.0 72787.0 4.101115e-02 48725.0 1.0 150119.0 4.148950e-02 97201.0
12 small 41.0 460.0 0.003861 416.0 7.0 20387.0 1.148686e-02 18024.0 12.0 30466.0 8.420114e-03 27451.0
13 staff 4.0 1761.0 0.014782 1634.0 3.0 41650.0 2.346730e-02 38949.0 3.0 75115.0 2.076009e-02 68602.0
14 time 53.0 381.0 0.003198 340.0 35.0 6532.0 3.680394e-03 5750.0 10.0 32304.0 8.928095e-03 27412.0

The table highlights that the most frequent words overlap considerably among cities. In fact, 11 words are present within the first 15 positions in all the sets.